Add dataproc tpcds example notebook#607
Conversation
Signed-off-by: Hao Zhu <hazhu@hazhu-mlt.client.nvidia.com>
Greptile SummaryThis PR adds a Dataproc-specific TPC-DS benchmark notebook (
Confidence Score: 4/5Safe to merge as an example notebook; all findings are cosmetic or dead-code cleanup that do not affect benchmark correctness. The benchmark logic itself is sound — GPU/CPU runs are clearly separated, results are merged and plotted correctly, and the cluster setup instructions are complete. The issues found are limited to copy-paste residue: a wrong appName, an unused scala_version detection cell, and sparkmeasure being installed and configured at the cluster level without any actual usage in the notebook. The notebook TPCDS-SF3K-Dataproc.ipynb has the unused cells and wrong app name worth cleaning up before the example is widely shared. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Install packages\ntpcds_pyspark, sparkmeasure, pandas, matplotlib] --> B[Import modules]
B --> C[Detect Scala version from spark-sql JAR\n⚠️ result unused]
C --> D[Create SparkSession\nappName='NDS Example' ⚠️]
D --> E[Verify GPU acceleration\nspark.range + explain]
E --> F[Init TPCDS\ndata_path=gs://GCS_PATH_TO_TPCDS_DATA/]
F --> G[Register TPC-DS tables\ntpcds.map_tables]
G --> H[GPU Run\nspark.rapids.sql.enabled=True\ntpcds.run_TPCDS]
H --> I[CPU Run\nspark.rapids.sql.enabled=False\ntpcds.run_TPCDS]
I --> J[Merge results\ncompute speedup]
J --> K[Plot elapsed time comparison]
J --> L[Plot speedup factors]
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[Install packages\ntpcds_pyspark, sparkmeasure, pandas, matplotlib] --> B[Import modules]
B --> C[Detect Scala version from spark-sql JAR\n⚠️ result unused]
C --> D[Create SparkSession\nappName='NDS Example' ⚠️]
D --> E[Verify GPU acceleration\nspark.range + explain]
E --> F[Init TPCDS\ndata_path=gs://GCS_PATH_TO_TPCDS_DATA/]
F --> G[Register TPC-DS tables\ntpcds.map_tables]
G --> H[GPU Run\nspark.rapids.sql.enabled=True\ntpcds.run_TPCDS]
H --> I[CPU Run\nspark.rapids.sql.enabled=False\ntpcds.run_TPCDS]
I --> J[Merge results\ncompute speedup]
J --> K[Plot elapsed time comparison]
J --> L[Plot speedup factors]
Reviews (1): Last reviewed commit: "Clear a cell output" | Re-trigger Greptile |
There was a problem hiding this comment.
2 files reviewed, 3 comments
Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format
| "]\n", | ||
| "\n", | ||
| "demo_start = time.time()\n", | ||
| "tpcds = TPCDS(data_path='gs://gcs_bucket/parquet_sf3k_decimal/', num_runs=1, queries_repeat_times=1, queries=queries)" |
There was a problem hiding this comment.
syntax: gs://gcs_bucket is a placeholder - should be updated to match the $GCS_BUCKET variable pattern used in the README
| "tpcds = TPCDS(data_path='gs://gcs_bucket/parquet_sf3k_decimal/', num_runs=1, queries_repeat_times=1, queries=queries)" | |
| "tpcds = TPCDS(data_path='gs://$GCS_BUCKET/parquet_sf3k_decimal/', num_runs=1, queries_repeat_times=1, queries=queries)" |
| "text/html": [ | ||
| "\n", | ||
| " <div>\n", | ||
| " <p><b>SparkSession - hive</b></p>\n", | ||
| " \n", | ||
| " <div>\n", | ||
| " <p><b>SparkContext</b></p>\n", | ||
| "\n", | ||
| " <p><a href=\"http://testbyhao2-ubuntu22-m.c.rapids-spark.internal:46705\">Spark UI</a></p>\n", | ||
| "\n", | ||
| " <dl>\n", | ||
| " <dt>Version</dt>\n", | ||
| " <dd><code>v3.5.3</code></dd>\n", | ||
| " <dt>Master</dt>\n", | ||
| " <dd><code>yarn</code></dd>\n", | ||
| " <dt>AppName</dt>\n", | ||
| " <dd><code>PySparkShell</code></dd>\n", | ||
| " </dl>\n", | ||
| " </div>\n", | ||
| " \n", | ||
| " </div>\n", | ||
| " " | ||
| ], |
There was a problem hiding this comment.
Please clear the notebook output for the PR
There was a problem hiding this comment.
Cleared the all output.
|
Please add a PR description |
Signed-off-by: Hao Zhu <hazhu@hazhu-mlt.client.nvidia.com>
Signed-off-by: Hao Zhu <hazhu@hazhu-mlt.client.nvidia.com>
Signed-off-by: Hao Zhu <hazhu@hazhu-mlt.client.nvidia.com>
|
Per offline conversation let us try to add knobs for hosted Spark and hosted Data so we can accommodate these use cases in the original TPC-DS notebook instead of adding a clone with few modifications. We will gradually expand the README in the follow up PRs to explain how to run this notebook in different Cloud providers |
|
Please add a performance benchmark running on the CPU vs. GPU. |
Request here is to provide a notebook specific to each environment, so users do not need to make any changes. Make it as simple as possible for the user. Understand that will create maintenance overhead. |
The PR already assumes CSP-specific instructions for launching it if you look at the proposed README changes. I bet that there is already enough specifics in the default environment even without it to make minor adjustments to create minor CSP-specific logic in the notebook. If not it can be part of the command documented for the user anyways. |
|
NOTE: release/26.02 has been created from main. Please retarget your PR to release/26.02 if it should be included in the release. |
|
NOTE: release/26.04 has been created from main. Please retarget your PR to release/26.04 if it should be included in the release. |
Add an example tpcds notebook for GCP dataproc.